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Abstract. The past decade has seen the growing popularity of Bag of Features (BoF) 
approaches to many computer vision tasks, including image classification, video search, 
robot localization, and texture recognition. Part of the appeal is simplicity. BoF meth- 
ods are based on orderless collections of quantized local image descriptors; they discard 
spatial information and are therefore conceptually and computationally simpler than many 
alternative methods. Despite this, or perhaps because of this, BoF-based systems have set 
new performance standards on popular image classification benchmarks and have achieved 
scalability breakthroughs in image retrieval. This paper presents an introduction to BoF 
image representations, describes critical design choices, and surveys the BoF literature. 
Emphasis is placed on recent techniques that mitigate quantization errors, improve fea- 
ture detection, and speed up image retrieval. At the same time, unresolved issues and 
fundamental challenges are raised. Among the um-esolved issues are determirung the best 
techniques for sampling images, describing local image features, and evaluating system 
performance. Among the more fundamental challenges are how and whether BoF meth- 
ods can contribute to localizing objects in complex images, or to associating high-level 
semantics with natural images. This survey should be useful both for introducing new in- 
vestigators to the field and for providing existing researchers with a consolidated reference 
to related work. 



1. Introduction 

The past decade has seen the rise of the Bag of Features approach in computer vision. 
Bag of Features (BoF) methods have been appHed to image classification, object detection, 
image retrieval, and even visual localization for robots. BoF approaches are characterized 
by the use of an orderless collection of image features. Lacking any structure or spa- 
tial information, it is perhaps surprising that this choice of image representation would be 
powerful enough to match or exceed state-of-the-art performance in many of the appli- 
cations to which it has been applied. Due to its simplicity and performance, the Bag of 
Features approach has become well-established in the field. This paper seeks to character- 
ize the Bag of Features paradigm, providing a survey of relevant literature and a discussion 
of open research issues and challenges. We focus on the application of BoF to weakly 
supervised image classification and unsupervised image retrieval tasks. 

This survey is of interest to the computer vision community for three reasons. First, Bag 
of Features methods work. As discussed below, BoF-based systems have demonstrated 
comparable or better results than other approaches for image classification and image re- 
trieval, while being computationally cheaper and conceptually simpler. Second, the BoF 
approach has appeared imder different names in several seemingly unrelated branches of 
the literature. Besides the computer vision Uterature, where the term "Bag of Features" 
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was coined, closely related approaches appear in the literature on biological modeling, 
texture analysis, and robot localization. As a result, there is more work on BoF than many 
researchers may be aware of. Finally, the Bag of Features approach is a multi-step process, 
with each step presenting many options. Many plausible combinations have never been 
tried. We hope to contribute to future advances in the field through this survey by mapping 
out the space of BoF algorithms, recording what is known about the steps and how they 
interact, and identifying remaining research opportunities. 

Although there is no contemporary survey on Bag of Features methods, related sur- 
veys include Frintrop's survey of computational visual saliency (Frintrop et al , '2010') and 
Datta's overview of Content-Based Image Retrieval (CBIR) techniques ( Datta et al, 2008 ) . 
Frintrop reviews techniques that share similarities to the feature extraction stage of BoF 
methods, as discussed in Section [3] of this report. Datta's CBIR survey discusses BoF- 
based image retrieval to some extent, but we present a broader survey of BoF techniques 
with more details on state-of-the-art methods and specific mechanisms that have been em- 
ployed to improve query results and speed. 

This paper is organized as follows. Section |2]provides an overview of the Bag of Fea- 
tures image representation. Section [3]provides details on the feature detection and extrac- 
tion techniques commonly employed in BoF representations. Vector Quantization is an 
important aspect of the BoF approach, and so Section]?] discusses the quantization chal- 
lenges and how the BoF community has attempted to address them. Two popular appli- 
cations of BoF representations include image classification and image retrieval, which are 
presented in Sections |5]and|6] respectively. SectionjTjlooks at the evaluation of BoF meth- 
ods, including common performance measures, data sets, and the challenges involved with 
comparative evaluation. Although BoF methods have shown promising performance in a 
number of tasks, there remain open issues and inherent limitations. We explore a few of 
these in Section[8] Section[9]has our concluding remarks. 

2. Bag of Features Image Representation 

A Bag of Features method is one that represents images as orderless collections of 
local features. The name comes from the Bag of Words representation used in textual 
information retrieval. This section provides an explanation of the Bag of Features image 
representation, focusing on the high-level process independent of the application. More 
sophisticated variations and other implementation details are discussed later in this report. 

There are two common perspectives for explaining the BoF image representation. The 
first is by analogy to the Bag of Words representation. With Bag of Words, one represents a 
document as a normalized histogram of word counts. Commonly, one counts all the words 
from a dictionary that appear in the document. This dictionary may exclude certain non- 
informative words such as articles (like "the"), and it may have a single term to represent 
a set of synonyms. The term vector that represents the document is a sparse vector where 
each element is a term in the dictionary and the value of that element is the number of 
times the term appears in the document divided by the total number of dictionary words in 
the document (and thus, it is also a normalized histogram over the terms). The term vector 
is the Bag of Words document representation - called a "bag" because all ordering of the 
words in the document have been lost. 

The Bag of Features image representation is analogous. A visual vocabulary is con- 
structed to represent the dictionary by clustering features extracted from a set of training 
images. The image features represent local areas of the image, just as words are local fea- 
tures of a document. Clustering is required so that a discrete vocabulary can be generated 
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from millions (or billions) of local features sampled from the training data. Each feature 
cluster is a visual word. Given a novel image, features are detected and assigned to their 
nearest matching terms (cluster centers) from the visual vocabulary. The term vector is 
then simply the normalized histogram of the quantized features detected in the image. 

The second way to explain the BoF image representation is from a codebook perspec- 
tive. Features are extracted from training images and vector quantized to develop a visual 
codebook. A novel image's features are assigned the nearest code in the codebook. The 
image is reduced to the set of codes it contains, represented as a histogram. The normal- 
ized histogram of codes is exactly the same as the normalized histogram of visual words, 
yet is motivated from a different point of view. Both "codebook" and "visual vocabulary" 
terminology is present in the surveyed Uterature. 

The BoF term vector is a compact representation of an image which discards large- 
scale spatial information and the relative locations, scales, and orientations of the features. 
A contemporary large-scale BoF-based image retrieval system might have a dictionary of 
100,000 visual words and 5,000 features extracted per image. Thus in an image where there 
are no duplicate visual words (unusual), the term vector will have 95% of its elements as 
zeros. The strong sparsity of term vectors allows for efficient indexing schemes and other 
performance improvements, as discussed in later sections. 

At a high level, the procedure for generating a Bag of Features image representation is 
shown in Figure [T] and summarized as follows: 

(1) Build Vocabulary: Extract features from all images in a training set. Vector quan- 
tize, or cluster, these features into a "visual vocabulary," where each cluster rep- 
resents a "visual word" or "term." In some works, the vocabulary is called the 
"visual codebook." Terms in the vocabulary are the codes in the codebook. 

(2) Assign Terms: Extract features from a novel image. Use Nearest Neighbors or a 
related strategy to assign the features to the closest terms in the vocabulary. 

(3) Generate Term Vector: Record the counts of each term that appears in the image 
to create a normalized histogram representing a "term vector." This term vector 
is the Bag of Features representation of the image. Term vectors may also be 
represented in ways other than simple term frequency, as discussed later 

There are a number of design choices involved at each step in the BoF representation. 
One key decision involves the choice of feature detection and representation. Many use 
an interest point operator, such as the Harris- Affine detector (Mikolajczyk et al] [2005 | l or 
the Maximally Stable Exti-emal Regions (MSER) detector (Matas et al| |2004j). At every 
interest point, often a few thousand per image, a high-dimensional feature vector is used 
to describe the local image patch. Lowe's 128-dimension SIFT descriptor is a popular 



choice (Lowe 1999 1. Other choices of feature detection and representation are discussed 
in Section[3] 

Another pair of design choices involve the method of vector quantization used to gen- 
erate the vocabulary and the distance measure used to assign features to cluster centers. A 
distance measure is also required when comparing two term vectors for similarity (as is 
done with image retrieval), but this measure operates in the term vector space as opposed 
to the feature space. Quantization issues and the choice of distance measure can impact 
term assignment and similarity scoring. These issues are discussed in Section|4] 

3. Feature Detection and Representation 



If you are going to represent an image as a Bag of Features, the feature had better be 
good! This may be easier said than done. There are many possible approaches to sampling 
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Figure 1 . Process for Bag of Features Image Representation 

image features, including "Interest Point Operators," "Visual Saliency," and random or 
deterministic grid sampling. Further, the definition of what makes a good feature may be 
application-dependant, and a universally-accepted measure of fitness for localized features 
has yet to be developed. 

3.1. Feature Detection. Feature detection is the process of deciding where and at what 
scale to sample an image. The output of feature detection is a set of keypoints that specify 
locations in the image with corresponding scales and orientations. These keypoints are 
distinct from feature descriptors, which encode information from the pixels in the neigh- 
borhood of the keypoints. Thus, feature detection is a separate process from feature repre- 
sentation in BoF approaches. Feature descriptors are presented in Section [372| 

There is a substantial body of literature that focuses on detecting the location and ex- 
tent of good features from at least two different sub-fields of computer vision. The first 
developed from the goal of finding keypoints useful for image registration that are stable 
under minor affine and photometric transformations. These feature detection methods are 
referred to as Interest Point Operators. The second group detects features based on compu- 
tational models of the human visual attention system. These methods are concerned with 
finding locations in images that are visually salient. In this case, fitness is often measured 
by how well the computational methods predict human eye fixations recorded by an eye 
tracker. In the following two subsections, we discuss both approaches to feature detection. 

Finally, there is research that suggests generating keypoints by sampling the images 
using a grid or pyramid structure, or even by random sampling. Deterministic and random 
sampling approaches are discussed in this section as well, below. 

3.1.1. Interest Point Operators. While there are many variations, an interest point operator 
typically detects keypoints using scale space representations of images. A scale space 
represents the image at multiple resolutions, and is generated by convolving the image with 
a set of guassian kernels spanning a range of a values. The result is a data structure which 
is, among other things, a convenient way to efficiently apply image processing operations at 
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multiple scales. For details on scale space representations, see |Lindeberg| ( [T993] l. Interest 
point operators detect locally discriminating features, such as corners, blob-like regions, 
or curves. Responses to a filter designed to detect these features are located in a three 
dimensional coordinate space, (x,y,s), where (x,y) is the pixel location and s is the scale. 
Extremal values for the responses over local {x,y, s) neighborhoods are identified as interest 
points. 

Perhaps the most popular keypoint detector is that developed by Lowe (Lowe 1999 1, 
which employs a Difference-of-Gaussians (DoG) filter for detection. DoG responses can 
be conveniently computed from a scale space structure, and extremal response values 
within a local {x,y,s) region used as interest points. Kadir and Brady designed a keypoint 
detector called Scale Saliency ( |Kadir and Bradyj |2001| l to find regions which have high 
entropy within a local scale-space region. Another popular keypoint detector, called the 
Harris-Affine detector (Mikolajczyk and Schmid 2004 j ), extends the well-known Harris 
Corner Detector (Harris and Stephens 1988 1 to a scale space representation with oriented 
elliptical regions. The Maximally Stable Extremal Regions (MSER) keypoint detector 
finds elliptical regions based on a watershed process ( Matas et al , 2004 ). These are but a 
handful of examples of the many interest point operators designed to be robust to small 
affine and photometric image transformations, with the goal of being able to find the same 
keypoints in two similar but distinct images. 

In practice, the state-of-the-art Bag of Features methods tend to use similar feature 
detectors. Many use either the Harris-Affine or the MSER feature detectors. In large part, 
this is due to a study in 2005 of affine region detectors (Mikolajczyk et al||2005[ ), published 
as a collaborative effort by many leading researchers in the field. While the conclusions of 
this study are subject to interpretation, the Harris-Affine and MSER detectors performed 
well under a variety of situations. 



3.1.2. Visual Saliency. Bag of Features methods often rely on interest point operators to 
detect the location and scale of localized regions from which image features are extracted. 
Similarly, many biologically-inspired, or biomimetic, computer vision systems use local- 
ized regions as well. In the biomimetic computer vision literature, the interest point opera- 
tor is based on computational models of visual attention. A recent survey by Frintrop et al. 
(Frintrop et al 2010 1 provides details on many computational visual attention methods, so 
we do not reproduce that material here. Our intent is to point out the similarities between 
the BoF literature, which uses Interest Point Operators, to the visual saliency literature, 
with the hope that there will be more direct cross-referencing between the two fields in the 
future. 

Itti and Koch proposed a popular model which builds upon a thread of research started 
by Koch and Oilman in the 1980's ptti and Koch) [2000} [Koch and Ullman| [T985]l. The 



Itti and Koch saliency model, at a high level, looks for extrema in center-surround patterns 
across several feature channels, such as color opponency, orientation (Gabor filters), and 
intensity. The center-surround extrema can be detected using Laplacian-of-Gaussian (LoG) 



or Difference-of-Gaussian (DoG) filters, similar to the models of Lindeberg ( [Lindeberg 
[T993] l and Lowe ( |Lowe|[1999| l. 



Bruce and Tsotsos present an information theoretic approach to saliency ( [Bruce and 



Tsotsos 2009| l, which is conceptually similar to Kadir and Brady's Scale Saliency interest 



point operator mentioned earlier A local region is interesting not just based upon a certain 
pattern or filter response, but also because it differs significantly from the surrounding 
context of neighboring pixels. 
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Other computational models of visual saliency continue to be explored in the literature, 
with an interdisciplinary sub-field that is interested in which computational model best 



predicts human attentional fixations as measured with an eye-tracker (Elazary and Itti 
[2008,,Kienzie et al||2009||Tatler et al|[2005] l. 

3.1.3. Deterministic and Random Sampling. A more fundamental question than which 
interest point detector to use, is whether or not to use an interest point detector at all. One 
may characterize the extraction of localized features as an image sampling problem. While 
keypoint operators are useful for image alignment, it is an open question whether this is 
the ideal way to sample localized features for image matching or classification. 

For example, in the Video Google paper ( Sivic and Zisserman , 2003), the authors present 
a pair of similar images from the movie "Run Lola, Run" that fails to match well in a BoF 
image retrieval task. The images show Lola running down a sidewalk in an urban area. 
The images have high amounts of low texture concrete - sidewalk, building facade, and 
roadway. An illustration of keypoint locations shows that feature detection leaves about 
half of the images unsampled. While "uninteresting" to the feature detectors, the blandness 
of large portions of the images is in itself potentially highly discriminating. 

Maree et al. describe an image classification algorithm featuring random multiscale 
subwindows and ensembles of randomized decision trees ( Maree et al||2005] l. While the 
algorithm is not strictly a BoF approach, it illustrates the efficacy of random sampling. 
Nowak, Jurie, and Triggs explored samphng strategies for BoF image classification in 
( Nowak et al| [2006 1. They show that when using enough samples, random sampling ex- 



ceeds the performance of interest point operators. They present evidence that the most 
important factor is the number of patches sampled from the test image, and thus claim 
dense random sampling is the best strategy. 

Spatial Pyramid Matching ( Lazebnik et al| [2006| l uses SIFT descriptors extracted from 
a dense grid with a spacing of 8 pixels. K-means clustering is used for constructing the 
vocabulary. But instead of directly forming the term vector for a given image, the terms 
are collected in a pyramid of histograms, where the base level is equivalent to the standard 
BoF representation for the complete image. At each subsequent level in the pyramid, the 
image is divided into four subregions, in a recursive manner, with each region at each 
pyramid level having its own histogram (term vector). The distance between two images 
using this spatial pyramid representation is a weighted histogram intersection function, 
where weights are largest for the smallest regions. Doing so, Lazebnik captures a degree 
of location information beyond the standard orderless BoF representation. 



3.2. Feature Descriptors. In addition to determining where and to what extent a feature 
exists in an image, there is a separate body of research to determine how to represent the 
neighborhood of pixels near a localized region, called the feature descriptor The simplest 
approach is to simply use the pixel intensity values, scaled for the size of the region, or 
an eigenspace representation thereof. Normalized pixel representations, however, have 
performed worse than many more sophisticated representations (see Fei and Perona ( 2005) 1; 



Nowak et al ( 2006 1, among others) and have largely been abandoned by the BoF research 



community. 

The most popular feature descriptor in the BoF literature is the SIFT (Scale Invariant 
Feature Transform) descriptor ( |Lowe[ [2004[ ). In brief, the 128 dimensional SIFT descrip- 
tor is a histogram of responses to oriented gradient filters. The responses to 8 gradient 
orientations at each of 16 cells of a 4x4 grid generate the 128 components of the vector. 



INTRODUCTION TO THE BAG OF FEATURES PARADIGM FOR IMAGE CLASSIFICATION AND RETRIEVAL 7 



The histograms in each cell are block- wise normalized. At scale 1, the cells are often 3x3 
pixels. 

An alternative to the SIFT descriptor that has gained increasing popularity is SURF 
(Speeded Up Robust Features) (Bay et al 2006) . The SURF algorithm consists of both 
feature detection and representation aspects. It is designed to produce features akin to 
those produced by a SIFT descriptor on Hessian-Laplace interest points, but using efficient 
approximations. Reported results indicate that SURF provides a significant speed-up while 
matching or improving performance. 

Other descriptors which have been proposed include Gabor filter banks, image mo- 
ments, and others. A study by Mikolajczyk and Schmid compares several feature descrip- 
tors, and shows that SIFT-like descriptors tend to outperform the others in many situa- 
tions (Mikolajczyk and Schmid' 2005). The descriptors that were evaluated, however, lack 
color information. This is in contrast to the biomimetic vision community which typically 
includes a color-opponency aspect to feature representations. There is evidence that in- 
cluding color information in feature detection and description may improve BoF image 
retrieval performance piang et al] |2007) . A recent paper by van de Sande et al. presents 
an evaluation of color feature descriptors. Reported results indicate a combination of color 
descriptors outperforms SIFT on an image classification task and that, of the color descrip- 
tors, OpponentSIFT is most generally useful (van de Sande et al 2010[ l. 

Finally, there has been investigation on learning discriminative features for a given task 
or data set, instead of using an a-priori selected representation. Efforts include a method 
for unsupervised learning of discriminative feature subsets and their detection parameters 
( Karlinsky et al 2009 1, and a modular decomposition of feature descriptors and a method 
for learning the best composition (Winder and Brown 2007 Winder et al 2009). Winder 
demonstrates that many common feature descriptors, such as SIFT, can be generated this 
way, yet there are others that can be learned that perform better under certain measures. 



4. Quantization and Distance Measures 

Vector Quantization (Clustering) is used to build the visual vocabulary in Bag of Fea- 
tures algorithms. Nearest-neighbor assignments are used not only in the clustering of fea- 
tures but also in the comparison of term vectors for similarity ranking or classification. 
Thus, it is important to understand how quantization issues, and the related issues involv- 
ing measuring distances in feature and term vector space, affect Bag of Features based 
applications. 



There are a great many clustering/vector quantization algorithms, and this report does 
not attempt to enumerate them. Many BoF implementations are described as using K- 



means (|Sivic and Zisserman 2003 Lazebnik et al , 2006 ; Jian 


g et al 


2007|, or an approx- 


imation thereof for large vocabularies (Nister and Stewenius 


,2006 


,Philbin et al 2007 1. 



Given any clustering method, there will be points that are equally close to more than one 
centroid. These points lie near a Voronoi boundary between clusters and create ambiguity 
when assigning features to terms. With K-means and similar clustering methods, the choice 
of initial centroid positions affects the resultant vocabulary. When dealing with relatively 
small vocabularies, one can run K-means multiple times and select the best performing 
vocabulary during a validation step. This becomes impractical for very large data sets. 
When determining the distance between two features, as required by clustering and term 
assignment, common choices are the Manhattan (Li), Euclidean (L2), or Mahalanobis dis- 
tances. A distance measure is also needed in term vector space for measuring the similarity 
between two images for classification or retrieval applications. Euclidean and Manhattan 
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distances over sparse term vectors can be computed efficiently using inverted indexes (see 



6.3.1 1, and are thus popular choices. However, the relative importance of some visual 
words leads to the desire to weight term vectors during the distance computation. The 
following sub-sections provide more details on these and other issues. 

4.1. Term Weights. One of the earliest strategies for handling quantization issues at a 
gross level is to assign weights to the terms in the term vector This can be viewed as a 
mitigation strategy for quantization issues that occur when the descriptors are distributed in 
such a way that simple clustering mechanisms over-represent some descriptors and under- 
represent others. With term weights, one can penalize terms found to be too common to be 
discriminative and emphasize those that are more unique. This is the motivation behind the 
popular Term Frequency-Inverse Document Frequency (TF-IDF) weighting scheme used 
in text retrieval ( jSalton and McGill|[T983] l. 

is the number of documents (images) in the database, and A^,- is the number of documents 
in the database containing the i'th word. The log term is called the inverse document 
frequency and it serves to penalize the weights of common terms. 

Term vectors can also be represented as binary strings. A 1 is assigned for any term that 
appears in the image, otherwise. This might humorously be called "anti term weighing," 
as it has the effect of making the terms have equal weight no matter how often they occur 
in a particular image or in the corpus as a whole. Distribution issues are thus handled by 
simply erasing the frequency information from the term vector so a few oversampled terms 
can not dominate the distance computations between term vectors. Binary representations 
have the benefit of speed and compactness. 

BoF image retrieval implementations typically use TF-IDF weights, due to evidence 



TF-IDF is defined as: tfi ■ log(S-_), where tfi is the term frequency of the i'th word, 



that this method is superior to binary and term frequency representations ( Jiang et al 2007 



Sivic and Zisserman 2003). When using very large vocabularies, the term vectors become 



extremely sparse, and the term counts (prior to any normalization) are mostly zeros or 
ones. In this case, binary representations tend to perform similarly to term frequencies 



(Jiang et al 2007 1. 



4.2. Soft Assignment. A given feature may be nearly the same distance from two cluster 
centers, but with a typical "hard assignment" method, only the slightly nearer neighbor is 
selected to represent that feature in the term vector Thus, the ambiguous features that lie 
near Voronoi boundaries are not well-represented by the visual vocabulary. To address this 
problem, researchers have explored multiple assignments and soft weighting strategies. 

Multiple assignment is where a single feature is matched to k nearest terms in the vocab- 
ulary. Soft weights are similar, but the k nearest terms are multiplied by a scaling function 
such that the nearest term gets more weight than the A:'th nearest term. These strategies are 
designed to mitigate the negative impact when a large number of features in an image sit 
near a Voronoi boundary of two or more clusters. 

Jegou et al. show that multiple assignment causes a modest increase in retrieval accu- 
racy pegou et~al 2007| l. The cost of the improved accuracy is higher search time, due 



in part to the impact on term vector sparsity. The authors report that a k = 3 multiple 
assignment implementation requires seven times the number of multiplications of simple 
assignment. 

Soft weights have been explored by |Jiang et al| ( |200'7| i and Philbin et al (2008). In Jiang's 
work, the soft weights are computed as shown below. The computed weight for term n in 
the term vector w, denoted w„ is defined as; 
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k Mi ^ 

(1) w„ = ^ ^ — p5/m(;,n) 

'=1.7=1^ 

where k is the number of neighbors to use in the soft weighting strategy, M, is the number of 
features in the image whose i'th nearest neighbor is term n, and sim{j,n) is the similarity 
measure between feature j and term n. Jiang et al. suggest that k=4 works well. In 
experimental evaluations, over a variety of vocabulary sizes, Jiang's soft weighting strategy 
bests binary, term frequency, and TF-IDF schemes, with one marginal exception. 

Philbin et al. propose an approach that scales the term weight according to the distance 
from the feature to the cluster center ( Philbin et al| |2008 1. The weights are assigned pro- 



portionally to a Gaussian decay on the distance, ex;?(— ^), where d is the distance to the 
cluster center, and a is the spatial scale, selected such that a relatively small number of 
neighbors will be given significant weight. A problem with this continuous formulation is 
that all terms get a non-zero weight, so clipping very small values is prudent. Philbin et 
al. only compute the soft weights to a pre-determined number of nearest neighbors, which 
was three in their evaluations. Even with clipping, soft weights decrease the sparsity of the 
term vectors, and thus increase the index size and query retrieval times. After generating 
term vectors using the soft weighting strategy, Philbin et al. perform an Li normalization 
so that the resulting vector looks like a term frequency vector. TF-IDF is then applied, ig- 
noring soft assignment issues - i.e., the IDF is computed as if the input vector were created 
by the normal hard-assignment process. Evaluation results show a strong improvement in 
query accuracy. 

This result is consistent with the earlier observations by Jegou et al. and Jiang et al. that 
multiple assignment/soft weighting improves retrieval accuracy by mitigating some of the 
quantization errors for borderline features. In the body of literature surveyed by this report, 
no direct comparison between these three methods has been performed. 

4.3. Non-uniform Distributions. Jurie and Triggs show that the distribution of cluster 
centers is highly non-uniform for dense sampling approaches. When using k-means clus- 
tering, high-frequency terms dominate the quantization process, yet these common terms 
are less discriminative than the medium-frequency terms that can get lost in quantization. 
Instead, they propose an online, fixed-radius clustering method that they demonstrate pro- 



duces better codebooks (Jurie and Triggs 2005 i. Jegou et al. discuss the "burstiness" 



of visual elements, meaning that a visual word is more likely to appear in an image if it 
has appeared once before. Thus visual words are not independent samples of the image. 
Various weighting functions and strategies are proposed and evaluated (Jegou et al 2009a|. 



Similar to distribution issues within the feature space used to construct the vocabulary, 
term vectors can be non-uniformly distributed in the gallery set. Weighting strategies can 
be used to compensate for non-uniform term vector distributions, as discussed above, or a 
distance measure can be created that scales with local distributions in an attempt to regu- 
larize the space. 

The latter approach is explored by the Contextual Dissimilarity Measure (CDM) pegou 



et al 2007[ l. Jegou et al. point out the non-symmetry in a k-nearest-neighbors computa- 



tion, which is central both to the vocabulary generation in the feature space and also to 
computing similarity scores over the term vector space. Consider that some point x may be 
a neighbor of y, but the converse may not be true, as seen in Figure |2] They call this effect 
"neighborhood nonreversibility," and implicate it as part of a problem in BoF-based image 
retrieval that causes some images to be selected too often in query results while others are 



10 



STEPHEN O'HARA AND BRUCE A. DRAPER 



never selected at all. At a high level, CDM regularizes the term vector space by penalizing 
the distance measure for points in a local neighborhood that cause nonreversibility. 




Figure 2. Illustration showing nearest neighbors "nonreversibility." 
Point Y is a 3-nearest neighbor of X, but the converse is not true. The 
Contextual Dissimilarity Measure attempts to address this issue. 



A discussion of the details of CDM is beyond the scope of this survey, but Jegou et 
al. show the CDM computation results in a single scalar distance update term for each 
term vector in the gallery. Computation of the exact CDM values, performed offline while 
indexing the images, is of quadratic complexity in the number of images. Fortunately, the 
authors show that approximate methods work well, so the quadratic complexity is miti- 
gated in practice. The distance between a query term vector and a gallery term vector 
is simply the distance multiplied by the CDM update term for that gallery image. Any 
distance function appropriate to the term vector space can be used. Given a distance func- 
tion represented as d{q,Vj), with q as the query term vector, vj as the term vector for the 
j'th image in the gallery, and 5j as the distance update term for the j'th term vector, the 
Contextual Dissimilarity Measure is simply: 



(2) CDM{q,Vj)^8j*d{q,Vj) 

Evaluation results of applying the CDM regularization to an otherwise standard BoF 
image retrieval implementation shows significantly improved accuracy over contemporary 
methods. As a side note, their evaluations also show that the Manhattan (Li) distance 
works better than the Euclidean (L2), whether or not CDM is applied. 

4.4. Intracluster Distances. There is a trade-off involved in choosing the granularity of 
the quantization, where finer-grained clustering leads to potentially more discriminative 
information being preserved at the cost of increased storage and computational require- 
ments. A possible compromise is to use a courser-grained quantization, but compensate 
for the lack of discrimination by employing an efficient method for incorporating intra- 
cluster distances to weight terms. To this end, Jegou et al. developed a technique called 
Hamming Embedding, which efficiently represents how far a feature lies from the cluster 
center and thus how much weight to assign to that term for the detected feature (jJegouetal 
[2008) 1. 
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5. Image Classification using Bag of Features 

In the preceding sections, we have discussed BoF image representations somewhat in- 
dependently of the task. In this section and the next, we explore the two most common 
BoF applications in the literature, image classification and image retrieval. We start with 
image classification. 

5.1. Problem Definition. We use the term image classification to describe any supervised 
classification of images as a whole, as opposed to classification based more directly on the 
specific contents (objects) present in the image. BoF based image classification is the 
process of representing a training set of images as term vectors and training a classifier 
over this representation. A probe image can be encoded with the same dictionary and 
given to the classifier to be assigned a label. 

Similar to the image classification task, people discuss BoF-based object detection, ob- 
ject recognition, and scene classification. When the data set has essentially one object 
class per image, then the difference between image classification and object detection is 



blurred. This is the case, for example, in the popular CalTechlOl data (see (Fei-Fei et al 
|2004^ ). Due to the discarded spatial information, BoF representations are not well-suited 
to object detection and localization, but there are examples in the literature of doing so in 
an approximate fashion ( (Sivic et al||2005) l, e.g.). Scene classification is essentially syn- 
onymous with image classification. That is, the entire image is classified with no attempt 
to detect or localize specific objects it may contain. 

5.2. Related Methods. To help put the BoF image classification literature in context, we 
first investigate a set of related approaches. The first is a brief discussion of the similarity 
between BoF image classification and texture classification. As with Bag of Features, tex- 
ture classification methods commonly sample features from the image (a texture image), 
quantize those features, and build representative histograms, which are then used in a clas- 
sification task (see [Leung and M alik (2 0()T] l; [Liu and W ang (2002), among many others). 
In the texture recognition literature, "textons" refer to representative image patches and are 
analogous to visual words in the Bag of Features paradigm. Zhang et al. performed a study 
of local features and classification kernels for Support Vector Machine-based texture and 
image classification (jZhang eFal 2005 i, which indirectly illustrates how similar Bag of 



Features image classification is to the earlier body of work on texture representation and 
classification. 

A second body of related work is object detection using a part-based model. Unlike the 
orderless Bag of Features, a part-based model (also called a "constellation") learns a de- 
formable arrangement of features that represent an object class. Similar to BoF, part-based 
models often start with feature detection and extraction stages. Diverging from a BoF ap- 
proach, part-models typically employ a maximum likelihood estimation technique to deter- 
mine which of several previously learned models best explain the detected arrangement of 
features. Part-based methods have shown strong robustness in the face of background clut- 
ter, partial occlusions, and variances in the object's appeara nce ([Burl et al 1998 ; Crandall] 



and Huttenloche r 2006 ; F ergus et al||2003||Leordeanu et al||2007[|Sudderth et allpOOSl . A 



key weakness of part-based models is computational complexity. There is a combinatorial 
search required to determine which subset of features matches a given model. To reduce 
computation time, often only a small number of features are extracted from the image (one 
or two orders of magnitude smaller than with BoF methods), thus making the algorithms 
more sensitive to feature detection errors. 
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Another related technique is the use of a gist descriptor for scene classification, such as 



the HoUstic Spatial Envelope (Oliva and Torralba 2001 1. Gist-based approaches attempt to 
create a low-dimensional signature of an image that is computationally cheap and contains 
enough information for gross categorization. Both gist and BoF representations attempt 
to reduce an image to a low dimensional representation, but while BoF is a histogram 
of quantized local features, gist is a single global descriptor of the entire image. Gist 
representations are thus smaller and faster to compute, but with a commensurate loss of 
discriminatory information. The Holistic Spatial Envelope is akin to using a single 512- 



dimensional SIFT descriptor to represent the entire image, as described in ( Torralba et al 
[2008 ). 

Visual localization is the problem of determining approximate location based only on 
visual information. It is a related task to image classification, because input images must 
be classified as being part of a known location. Researchers have been using SIFT fea- 



tures as natural landmarks for many years ( Se et al 2002 Goedeme et al 2005 Spexard 

f"ar '2006V Recently, researchers have applied BoF representations directly to the task 
raundorfer et al 2007 ; Kang et al 2009 ; Kazeka, ,2010; Naroditsky et al|[2Dn9l ). 
Wrapping up our discussion of works relating to BoF image classification are biologi- 
cally inspired methods for scene classification. A well-known example is scene classifica- 
tion using the Neuromorphic Vision Toolkit ( jSiagian and Itti,.2007 ). Siagian and Itti detect 
local color, intensity, and gradient orientation features using non-maxima suppression over 
a scale space, but in contrast to BoF representations, they impose a spatial structure by 
dividing each image into a 4x4 grid, over which features are aggregated. More recently. 
Song and Tao presented a "biologically inspired feature manifold" representation of an 
image (Song and Tao 2010V With the assumption that the features are sampled from a 
low dimensional manifold embedded in a high dimensional space, the authors propose a 
manifold-based projection to reduce dimensionality while preserving discriminative infor- 
mation. They show SVM-based scene classification on this representation yields much 



faster and more accurate results than that of Siagian and Itti ( Song and Tao 2010 1. 



5.3. The BoF Approach. This section provides an overview of key papers on BoF image 
classification, in roughly chronological order to show how the technique has evolved over 
the past several years. 

Csurka et al. present an early application of the Bag of Features representation for image 
classification (|Csurka et al 2004 1. Their front-end process closely follows the one outlined 
in Section[2] They employ the Harris- Affine keypoint detector and the 128-dimension SIFT 
descriptor. They generate several vocabularies using k-means clustering, varying the value 
of k. They compare the use of Naive Bayes and linear Support Vector Machine (SVM) 
classifiers to learn a model which can predict the image class from its BoF representation. 
Experimental results show that SVM strongly outperforms Naive Bayes, and that larger 
vocabulary sizes (as measured by k, the number of cluster centers) perform better, within 
the tested range of 100-2500. To account for the random starting positions of k-means, ten 
vocabularies for each choice of k were tested, with the best results reported. 

Jurie and Triggs investigate BoF-based image classification, with an emphasis on sam- 



pling strategies and clustering methods for creating vocabularies ( jJurie and Triggs 2005 1. 
Features are extracted by densely sampling grayscale variance-normalized 11x11 patches 
over a multiscale grid. Vocabulary creation uses a novel Mean-Shift-like clustering method 
that is purported to better account for the highly non-uniform densities in feature space. 
Naive Bayes and linear SVM classification results are reported, using 10-fold cross vali- 
dation, showing SVM outperforming Naive Bayes in all tests. Codebooks were generated 
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using 2500 cluster centers. The authors compare their method (dense pyramid sampling 
and novel clusterer) with two others: dense sampling with k-means clustering, and a Differ- 
ence of Gaussian (DoG) keypoint detector with k-means clustering. Their method performs 
best; the keypoint-based method performs worst. They conclude that sparse, keypoint- 
based representations fare poorly due to a loss of discriminative power, and that with dense 
sampling, k-means produces poor codebooks compared to one that enforces fixed-radius 
clusters. 

Zhang et al. perform a large-scale quantitative evaluation of BoF representations for 
both texture recognition and image classification (Zhang et al 2005 i. This evaluation 
looks at different feature detectors, different region descriptors, and SVM classifier ker- 
nels. While there was no single best method for all tests, the authors recommend a mix 
of detectors and features with complementary types of information. Further, the authors 
point out that using local features with the highest level of invariance does not yield the 
best results. 

Continuing along the same lines as their earlier work, Nowak, June, and Triggs further 
explore sampling strategies and other factors impacting BoF-based image classification 
( |Nowak et al||2006) l. In contrast to their earlier work, this implementation uses the SIFT 
descriptor, showing that it is superior to normalized pixel intensities. This work reinforces 
the fact that random dense sampling, assuming a high-enough number of keypoints per 
image, outperforms keypoint detectors for image classification with SVMs. They present 
results on the 2005 PASCAL Visual Object Classification challenge (see'Everingham et al| 
( |2006| l), showing superior accuracy in all categories when compared to the best individual 
results for each category. As such, it set a new high-water mark for image classification 
while using simpler methods than its contemporaries. 

Fei Fei and Perona present a BoF image classification approach based on a generative 
Bayesian hierarchical model (Fei and Perona 2005 i. Their model includes latent topics, 
or "themes," that are considered hidden variables. An algorithm based on Latent Dirichlet 
Allocation is applied to learn the model given weak supervision in the form of image class 
labels. While the accuracy of this approach may not compare to more recent SVM-based 
classification methods, it nevertheless is attractive for the ability to learn intermediate-level 
themes present in an image based on a Bag of Features representation. Their results show 
that the SIFT descriptor is superior to normalized 11x11 pixel intensities for their method. 
They also show that a dense grid-based sampling technique outperformed the two keypoint 
operators they evaluated, with the observation that the grid sampling technique generated 
the most feature points per image. 

Sivic et al. propose using Probabilistic Latent Semantic Analysis (pLSA) with a Bag 



of Features representation for image classification and even object localization ([Sivic et al 



|2005| l. Conceptually similar in some respects to Fei Fei and Perona's contemporary work 
(discussed above), Sivic et al. employ a probabilistic model that discovers hidden (latent) 
topics within the images based on the observed BoF representation. In this case, the ob- 
ject classes are considered the latent topics, such that an image containing a mixture of 
objects from different classes would be modeled as a mixture of latent topics. Feature 
detection is performed using both MSER and Harris-Affine keypoint operators. Features 
are represented by SIFT descriptors. As with the other techniques presented in this sec- 
tion, weak supervision is provided in the form of image labels. A notable contribution 
includes a method for localizing the objects (via rough segmentation) by selecting groups 
of maximally-likely features for a given topic. They also introduce visual word "doublets," 
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learned from spatially co-occurring features. Results indicate that using doublets helps 
increase the localization accuracy. 

Grauman and Darrell present the Pyramid Match Kernel for use with S VM-based image 
classification for BoF representations ( )Grauman and Darrell|,200 5 ). In the Pyramid Match 
Kernel, each feature set is mapped to a multi-resolution histogram. Histogram pyramids 
are matched with a weighted histogram intersection function, where courser bins have less 
weight than finer. This kernel approximates the optimal correspondence matching between 
sets of unequal cardinality, and is computationally faster than many comparable kernels. 
The Pyramid Match Kernel is a Mercer kernel (positive semi-definite), and thus SVM 
convergence is assured. 

Inspired by the Pyramid Match Kernel, Lazebnik et al. introduced the Spatial Pyramid 
Matching technique for image classification (Lazebnik et al 2006| l. The key difference 
between the two approaches is that the Pyramid Match Kernel uses a multi-resolution his- 
togram while Spatial Pyramid Matching uses a fixed sampling resolution but changes the 
size of the regions over which the histograms are formed. The Spatial Pyramid Match ker- 
nel is represented as the sum of pyramid match kernels over the quantized vocabulary. The 
Spatial Pyramid Matching technique no longer represents an image as an orderless collec- 
tion of features, thus it is not a true Bag of Features approach. However, it does reduce to 
a BoF approach when only a single pyramid level is used. Grauman and Darrell's Pyramid 
Match Kernel approach, on the other hand, maintains the orderless feature representation, 
but does not use a visual vocabulary. Lazebnik et al.'s Spatial Pyramid Matching uses 
SIFT descriptors extracted from a dense grid with spacing of 8 pixels. K- means clustering 
is used for constructing the vocabulary, which consists of either 200 or 400 clusters. 

In addition to presenting the Spatial Pyramid Matching kernel, Lazebnik et al. provide 
evidence that latent factor analysis, such as that in ( jFei and Perona 2005 Sivic et al 2005 1, 
hinders classification accuracy. The comparison is done using a single-level pyramid that 
is equivalent to a standard BoF representation with a 200-term vocabulary, using the same 
data set and protocol as (Fei and Perona, 2005) . They outperform Fei Fei and Perona's 
method by about 7%. When Lazebnik et al. apply pLSA to their model, their results drop 
back to be comparable with Fei Fei and Perona. 

Jiang, Ngo, and Yang evaluate several factors that impact BoF image classification us- 
ing S VMs piang et al] |2007[ ), most notably the choice of feature detector, the term vector 
weighting scheme, and the SVM kernel. SIFT descriptors are used in all evaluations. 
Using term frequency, 1,000 visual words, and a RBF kernel, Jiang et al. show that 
the DoG interest point operator outperforms other popular methods on the PASCAL-2005 
data set. Note that the experiment controlled the average number of keypoints per image 
to between 750 and 925 for all detectors. Jiang et al. compare term weighting strategies 
as well, evaluating binary, term frequency, TF-IDF, and soft weighting. For all but the 
smallest vocabulary size used in the evaluation on the PASCAL-2005 data, soft weighting 
proved superior SVM Kernels that were evaluated by the authors include Linear, His- 
togram Intersection (HI), Gaussian RBF, Laplacian RBF, Sub-linear RBF, and RBF. On 
the PASCAL-2005 data set, the best mean equal error rates occurred for the latter three 
of the six kernels. The authors subsequently recommend the RBF and Laplacian RBF 
kernels. Combining their recommended selections of DoG feature detector, soft weighting 
scheme, and x^ RBF kernel, they show results comparable to the then-current state-of-the- 
art performance on the PASCAL-2005 data set. They note that their performance is close 
to that of Nowak et al. (Nowak et al 2006 1 (the best reported result at the time), but that 
they use far fewer features per image compared to Nowak's dense sampling strategy. 
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6. Image Retrieval using Bag of Features 

6.1. Problem Definition. Another popular application of the Bag of Features image rep- 
resentation is image retrieval. We define image retrieval as the task of finding the most sim- 
ilar images in a gallery to a given query image. The gallery is a database of images, often 
from video. We differentiate this from Content-Based Image Retrieval (CBIR) approaches 
that try to find those images which contain a set of specific objects or other concepts, like 
"sky." In image retrieval as we define it, the process is query-by-example using an entire 
image or image subregion as the example. A Bag of Features based image retrieval algo- 
rithm returns results based on a similarity score (distance) between the query image term 
vector and the term vectors of the gallery images, ranked accordingly. Note that the BoF 
image retrieval approach requires no supervised training. This is an important strength 
of the approach, allowing the efficient indexing of large image sets with no ground truth 
training data. 

In contrast, a common CBIR approach is to create a large number of specific ob- 
ject/concept detectors and to index the gallery set based on the detection results for each. A 
query can be specified as various combinations of these objects/concepts, perhaps weigh- 
ing some objects as being more important to the user than others, and results returned 
based on the previously indexed detection results. The disadvantage of this approach is 
the necessity of creating a seemingly endless number of object detectors, each requiring 
ground truth labels, parameter selection, validation, and so on. 

The similarity-based image retrieval approach requires the user to present one or more 
sample images of what he is searching for The flexibility and efficiency of BoF image 
retrieval comes with the trade-off that it puts more effort on the user to generate a query 
and lacks the semantics required to support text-based queries. 



6.2. Related Methods. A recent survey of CBIR algorithms was conducted by Datta et al 



( 2008| l. Datta uses the term CBIR more generally than we have defined it here. His survey 



includes a wide variety of image/video query technologies. As our focus is the BoF rep- 
resentation, we explore this image retrieval method in more detail and focus. We refer the 
interested reader to Datta's survey for a comprehensive look at CBIR and the similarities 
between various approaches. 

Gist descriptors have also been applied for image retrieval. Gist was discussed in Sec- 
tion 5.1 in the context of image classification, but just as the BoF representation has been 



applied to multiple applications, so has gist. Torralba et al. developed compact gist-based 



representations used to retrieve results from a gallery of 12 million icon-sized images (Tor- 



ralba et al 2008 ). In a variant of the image retrieval task applied for image/video copy 



detection, Douze et al. compared the relative utility of gist and BoF representations, with 



the expected trade-off of speed vs. accuracy (Douze et al 2009 1. 



6.3. The BoF Approach. A seminal work defining the Bag of Features image retrieval 



approach is the "Video Google" paper of Sivic and Zisserman (2003). Video Google uses 



both MSER and Harris-Affine keypoint detectors to detect features, which are represented 
by SIFT descriptors. The vocabulary is built using k-means clustering. Nearest Neighbor 
term assignment and the Euclidean distance on TF-IDF weighted term vectors are used 
for similarity scoring. Additionally, Video Google employs a "spatial consistency" step 
that attempts to validate the search results by aligning subsets of interest points with an 
approximate affine transformation. 

While Video Google demonstrated the effectiveness of the BoF representation for im- 
age retrieval, it was far from being "Google-like" in scale. Since then, researchers have 
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extended this basic approach to (1) deal with much larger scale data sets, (2) improve the 
initial query results, and (3) improve results ranking using post-query rank adjustment. Of 
these three major areas, the improvements to initial query results have stemmed from im- 
proving feature detection and representation (see Section [3]) and addressing quantization 
issues (see Section]?]). Improvements relating to (1) and (3) are discussed below. Follow- 
ing that, we discuss the challenge of trying to generate a general visual vocabulary which 
could be used to index any image set. 

6.3.1. Scalability. To perform Internet-scale image retrieval. Bag of Features methods re- 
quire efficient indexing strategies and the ability to handle large vocabularies. In Video 
Google, the number of images in the database was approximately 4,000, and the visual 
vocabulary was on the order of 10,000 terms. An inverted file system structure was used 
to make queries efficient, resulting in query response time of about 0. 1 seconds over 4,000 
images. 

Essentially, an inverted file system (or inverted index) method for indexing the gallery 
term vectors is one where each term records the images in which it appears, along with the 
term frequency for each image. The motivation for the inverted index is to take advantage 
of the sparseness of the term vectors. If each image yields approximately 2,000 interest 
points and the vocabulary size is lOK terms, then the resulting term vector (assuming no 
multiple/soft assignments) will have at most 20% of its entries as non-zero. In a larger 
system with a vocabulary of lOOK to IM terms, the term vectors will be even sparser. 

A simple approach to ranking query results is to compute the distance between the query 
vector and the term vectors of each gallery image, requiring an 0{N) computation in term 
of the number of gallery images. The majority of those images are likely to have very few 
terms in common due to the sparsity of the vectors, however Instead, one can look at each 
non-zero term in the query vector and get the list of images in which that term appears via 
the inverted index. Compiling this list for all non-zero terms will result in a subset of the 
gallery that contains partial matches. The normalized Lp distance to each image discovered 
in the inverted index can be computed incrementally as the inverted index is traversed, such 
as demonstrated in (Nister and Stewenius 2006| l. While the worst case complexity remains 



0{N), there is a huge speed improvement for the average case. Inverted indexes are used 
in all contemporary BoF-based image retrieval methods that were surveyed in this report. 

Additionally, the use of a stop-words list, a technique from text retrieval that eliminates 
the most common and least discriminative words ("a," "the," etc.), can also be used in im- 
age retrieval. In this case, one simply eliminates any term from the vocabulary that appears 
in too many images. Stop-words were noted as being helpful in (| Sivic and Zissermanj 
2003 [ Nister and Steweniusj pb06), and clearly are helpful in avoiding the traversal of a 



long list of documents in an inverted file system. 

However, even with the use of an inverted index and stop-words, more needs to be 
done to achieve acceptable performance in very large data sets. Nister and Stewenius 
made a scalability breakthrough by applying a hierarchical clustering method to represent 
the vocabulary as a tree ( |Nister and Steweniusj |2006| l. Compared to the Video Google 
approach, Nister and Stewenius demonstrated that their vocabulary tree approach allows 
for much larger vocabularies, with results reported on vocabulary trees having IM leaf 
nodes, and an anecdotal mention of up to 16M leaf nodes. They use an inverted index 
listing at each node in the vocabulary tree and show how to efficiently compute the term 
vector similarity score over their tree structure. Their large-scale data set was constructed 
by embedding 6376 ground-truth similar images into a large corpus generated from every 
frame of seven full length movies, yielding over 1 million images. They report that queries 
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on this data set run in RAM on an 8GB machine and take approximately 1 second each. 
Index creation took 2.5 days, which was dominated by feature extraction time. A benefit 
of the vocabulary tree approach is that, in principle, one can build the tree online and 
incrementally, which may make it suitable for tasks such as visual location awareness 
in mobile robots, where the robot incrementally builds its vocabulary as it explores its 
surroundings. 

Philbin et al. use a different approach by employing an approximate k-means clusterer 
during vocabulary creation (Philbin et al 2007 | l. Their approximate k-means (AKM) al- 
gorithm uses a forest of randomized k-d trees. Lowe describes the use of a k-d tree to 
approximate nearest neighbor computations of SIFT descriptors (ILowel.l2 004|l, and others 
have further developed the concept for increased speed and accuracy ( Silpa -Anan and Hart^ 
ley] |2004l [2008] [Moosmann et alj [20061 [20071 |Muj a and Lowe||20n8| l. Philbin et al. use an 
implementation provided by Lowe, which reduces the complexity of a single k-means it- 
eration from 0{Nk) to 0{N\og{k)), which is the same complexity as Nister's hierarchical 
k-means (HKM) approach. Philbin et al. compare AKM with HKM, and claim equiva- 
lent speed but significantly greater accuracy. They conclude that AKM suffers less from 
quantization errors than HKM, and thus is superior 

Other scalable indexing methods have been explored, including Locality Sensitive Hash- 
ing (LSH) (|Ke et a lj 2004 Andoni et al) ,2006) , dimensionality reduction of BoF vectors 
( Chum et al[|2008[[Jegou et al|[2009b| l, and Product Quantization ( [Jegouet al[|2010a[ ). 

Another issue for massive image databases is the potential for parallelization of the 
query process. Philbin et al. note that their approach which uses a forest of k-d trees could 
be distributed to multiple servers. No significant work has been demonstrated to parallelize 
BoF based image retrieval to our knowledge, however. 



6.3.2. Post Query Processing. There have been several techniques put forth in the litera- 
ture for improving the performance of a Bag of Features image retrieval process by post- 
query processing of the result set. As mentioned previously, Sivic et al. employ a spatial 
consistency re-ranking process that alters the initial query results based on how well an 
estimated affine homography maps the feature points between the query and gallery image 
( |Sivic and Zisserrnan. 2003; Philbin et al| [2007] [2008) . While this method does improve 
the results, it adds significant additional computation and may not be feasible for massive 
data sets. 



Another way to improve query results is through rank aggregation ( Jegou et al 2010b i. 
In rank aggregation, the query is performed multiple times using a separate vocabulary 
(index) for each query. The results are aggregated by, for example, taking the mean rank 
of the results. This increases the query time and index size. The separate queries can be 
processed in parallel to mitigate the time penalty, but there is still the cost of aggregation. 
From a storage perspective, multiple indexes increase the space requirements. 

A third post-query mechanism is query expansion, as presented by Chum, et al. in 
(Chum et al 2007| l. With query expansion, top-ranked results from the initial query are re- 
submitted as additional queries in an attempt to increase recall at a given precision. Chum 
et al. use spatial consistency to help prevent false positives from being used in the query 
expansion. Their results show this is a critical step, as query expansion without spatial 
verification performs worse than not using query expansion at all. The authors compare 
several different query expansion strategies, the best of which increases the mean Average 
Precision retrieval score on the Oxford data set by 20 percentage points over baseline. 
Information about the mean Average Precision metric and the Oxford data set can be found 
in Section[2l 
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6.3.3. Generalization. Ideally, one might like to have a universal mechanism for image 
retrieval that does not require training on a specific data set. In most of the surveyed BoF 
image retrieval approaches, the vocabulary is trained on a subset of the gallery images. 
While this subset and the gallery set used for testing may be separate, there is still the 
concern that the vocabulary may be optimized for use on the data set as a whole. How 
representative of the gallery data does the vocabulary training data need to be for effec- 
tive image retrieval? If it must be somewhat representative, then how does one train a 
vocabulary that will work well for billions of web images? 

While there is no conclusive response to those questions, there are some encouraging 
results in the surveyed literature. Nowak et al. show that a codebook created from ran- 
domly generated SIFT vectors can be effective for BoF-based image classification ( |Nowak| 



et al 2006 1. Their study compared three codebooks evaluated over two databases: the Graz 



object set ( Opelt et al| ,2004 ) and KTH texture image set (Hayman et al 2004^ . The three 
codebooks used are: the random SIFT codebook, a codebook created from a subset of the 
Graz images, and a codebook created from a subset of the KTH images. Unsurprisingly, 
the KTH codebook yields the best classification accuracy on the KTH images and the Graz 
codebook is best on the Graz images. More interesting, perhaps, is that the random code- 
book has approximately the same error rates on both sets, increasing the error rate by 5-10 
percentage points over the set-specific codebook. Also of note is that the Graz codebook 
scores second-best on the KTH data and, similarly, the KTH codebook is second-best on 
the Graz data. This indicates that a codebook generated on a "real" image space performs 
better than a purely random codebook from the same feature space. While random code- 
books have not been evaluated for image retrieval, it is reasonable to believe the effects 
reported for image classification might carryover. 

Supporting the notion that it may be possible to create a generic, or universal, codebook 
is anecdotal results reported by the Video Google project (Sivic and Zisserman 2003 1. 
They report image retrieval was effective even when using a codebook generated from Run 
Lola, Run to index and retrieve images from the movie Groundhog Day. 



7. Evaluation 

This section provides an overview of the common methods for evaluating the perfor- 
mance of Bag of Features systems. A summary is presented of the quantitative metrics 
used to evaluate performance and the common benchmark data sets. Finally, this section 
concludes with the challenges in comparative evaluation and what could be done to im- 
prove the situation. 



7.1. Performance Metrics. One of the most common performance measures for image 
retrieval is the precision-recall curve. Precision is defined as the ratio of true positives 
returned by a query over the total number of results returned. Recall is the ratio of true 
positives returned by a query over the total number of true matches possible in the gallery. 
It has been said that state-of-the-art BoF-based image retrieval methods typically have high 
precision at low recall ( |Chum et al 2007j), meaning that the top results are highly relevant 
to the query, but many other relevant images are not returned. From the precision-recall 
curve, one can define the Average Precision (AP) as the area under the curve. A perfect 
precision-recall curve would have a perfect precision (1.0) at all recall levels, thus the 
maximum AP is 1.0. A variant of AP used by Philbin et al. (Philbin et al 2007 1 is the 
mean Average Precision (mAP), which is the AP averaged over 5 different query regions 
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from the same landmark on the Oxford data set. An Equal Error Rate (EER) for precision- 
recall curves is also used by some, and is the value on the curve where precision equals 
recall. 

Sponsored by the U.S. National Institute of Standards and Technology (NIST), the 
TRECVID community uses Inferred Average Precision (InfAP). InfAP is an approxima- 
tion to AP used when the dataset is too large or dynamic to allow for complete relevancy 
judgments on ranked results. InfAP treats the AP measure as the expectation over a random 
experiment, and is described in ( Yibnaz and Aslam 2006[ l. 

Another performance measure proposed for image retrieval is the Average Normalized 
Rank (ANR) (Sivic and Zisserman |2003). Given a known number of images that should 
ideally be retrieved for a given query, ANR computes a measure of the actual ranking 
versus the ideal, normalized by both the number of relevant images and also the size of the 
gallery. ANR is computed as follows, where is the number of gallery images, A^^ is the 
number of relevant images, and is the actual rank of the i'th relevant image: 



To help understand Equation [3] consider an example of five relevant images returned 
from a database of 1000 with the following ranks: {3,4,8,100,400}. Note that the order of 
the five images does not matter, one of them is ranked 3rd, one is 4th, one is 8th, etc. The 
perfect set of rankings would be: {1,2,3,4,5}. The term simply sums the values 

1 to A^,-, representing the perfect score. In this example, ANR = iqj^jjj ((3 + 4 + 8 + 100 + 
400) — 15) = =0.10. The interpretation is that the average ranking of a relevant image 
in this query was in the top 10%. The smaller the number, the better A perfect ANR score 
is zero. 

One final performance measure discussed in the literature is the Nister-Stewenius (NS) 
score, developed for use with the NS data set. The NS data set has 4 images each of 2,550 
objects. The NS score indicates the average number of the four images that are returned 
in the results for a given query. Note that in this setup, each of the four images is used 
as the query, and is also still part of the gallery. One expects the identical image will be 
returned in the query, thus while the minimum NS score is technically zero, only a very 
poor algorithm would fail to find an exact match. So the NS score effectively ranges from 
1 to 4, where 4 is a perfect result of all four images being returned when any one of them 
is given as the query. 

7.2. Comparative Evaluation. While comparing image classification approaches is rela- 
tively straight-forward given a common benchmark data set, there are challenges in trying 
to compare BoF image retrieval methods. The key challenge for comparative evaluation 
is the lack of consensus on performance measures, data sets, and evaluation protocols. 
NIST has been running the TRECVID challenge for video information retrieval for nearly 
a decade. Perhaps the reason why leading BoF image retrieval systems do not present 
TRECVID results is that the problem is slightly different. With the TRECVID information 
retrieval task, one must find those video frames that contain certain pre-defined concepts. 
With Video Google and its more contemporary progeny, the task is to find those images 
that best match a given image or image region. The TRECVID challenge is more of an 
image classification or object detection challenge in a large scale corpus, which favors 
SVM-based methods that have been trained to recognize a finite set of objects/concepts. 
The Video Google style task is more open-ended and, perhaps, more general - one cannot 
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enumerate the concepts or objects that a user might want to retrieve from video data. In- 
stead, the user is asked to provide an example, and the most relevant matches are returned. 

Nister and Stewenius evaluated their approach on a large scale (1M+ images) corpus, 
yet the bulk of the images were from copyright protected movies. The ground-truth subset, 
identified earlier as the NS data set, is approximately lOK of the IM images, and can 
be freely downloaded. To reproduce their reported results on the large scale data set, one 
would have to extract the image frames for all seven movies identified in their paper, which 
is time-consuming and possibly in violation of digital rights management laws. 

Oxford researchers have developed their own large scale benchmark consisting of 5K 
ground-truth images, gathered from Flickr, that show various Oxford landmark buildings. 
This Oxford 5K data set is embedded in approximately IM Flickr images gathered using 
the top 450 most popular tags. The Oxford 5K is available for download, but not the entire 
large scale corpus. To reproduce their results, one has to build a new data set from the 
top Flickr tags, which will differ from the Oxford set because the source of the data is not 
static. So exact reproduction of results using either the NS or Oxford large scale data sets 
is impossible until the entire original data sets are made available. 

Another issue with the Oxford data set is the focus on architectural landmarks, which 
represent rigid shapes with many coplanar surfaces. Evaluations on this data set, using a 
vocabulary generated from a similar set of images, may show a positive performance bias 
over a more general setup. Philbin et al ( 2008| l report results on a vocabulary trained on 



Paris architectural landmarks used for image retrieval on the Oxford images. In general, 
it is unreasonable to expect that the visual vocabulary would be generated by images of 
the same thematic material as the query. Furthermore, verifying the query results using 
approximate affine homographies is likely to work well for query targets containing rigid 
objects with many coplanar features. 

A somewhat newer data set is the INRIA Holidays images, also designed for the eval- 
uation of large scale image retrieval systems Pegou et~all [2008V Like the Oxford data, 
INRIA uses a set of ground truth images embedded in distractors culled from Flickr. To 
encourage comparative evaluation, the INRIA group provides data for the pre-computed 
features (Hessian Affine detector with SIFT descriptors) on 1 million Flickr images. The 
feature data is 235 gigabytes. This data allows comparative evaluation of million-image- 
scale image retrieval only for those approaches that use the same features. 

Ideally, an evaluation benchmark would provide a variety of query images and would 
be able to mimic common use-cases for this technology. A technique that works best 
for searching an action movie for car wrecks and explosions may be less than ideal for 
organizing your vacation photos - unless your vacations are substantially more exciting 
than ours! An evaluation benchmark should specify a neutral data set, testing protocol, 
and performance measures. Due to the costs in collecting, managing, and hosting terabyte- 
scale data sets, it may be best handled by an national or international standards body. 

8. Challenges 

Even though Bag of Features representations have proven powerful for image classi- 
fication and image retrieval, there remain challenges in applying them to other tasks. In 
this section, we describe a few of the limitations of the standard BoF representation, with 
the expectation that novel variants may yet emerge that mitigate these limitations while 
retaining the key strengths of the paradigm. 

8.1. Spatial Information. The lack of spatial information in traditional BoF represen- 
tations seems to make them poor choices for systems that localize objects in images or 
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describe relationships among objects. Term vectors pool information from across an im- 
age, making it difficult to extract relational concepts such as "a person standing next to 
a streetlight." Researchers have explored modifying BoF representations to encode some 
spatial information, for example Lazebnik et al.'s Spatial Pyramid Matching method dis- 
cussed in SectionjS] More spatial information might help with spatial tasks, such as finding 
small objects within cluttered scenes, or recognizing relations among objects. Stronger en- 
codings of spatial information, however, move away from the BoF paradigm and toward 
part-based or constellation models, with the risk of loosing the simplicity and performance 
of BoF methods. 



8.2. Semantic Meaning. While some visual words may represent distinct object parts, 
such as a headlight on a car, in practice most visual words have no simple linguistic de- 
scription. This lack of inherent semantic meaning in a BoF representation poses a chal- 
lenge for certain tasks, such as retrieving images from keywords or generating natural 
language descriptions from images. The nascent ImageNet database of Deng et al. (Deng] 
et al 2009| l may be one approach to bridging the gap between visual and linguistic terms. 



although other, yet to be invented techniques may be needed also. 



8.3. Performance Evaluation. The observation that BoF representations pool informa- 
tion from across an image does more than make it difficult to localize targets, it makes 
it difficult to assess what BoF systems are actually recognizing. A good example comes 
from Pinto et al (2009). They built a system that achieves state-of-the-art face recognition 
performance on the complex Labeled Faces in the Wild (LFW) data set. LFW contains 
images of celebrity faces culled from news photographs, and is considered a difficult face 
recognition data set because of the uncontrolled illumination and poses, and because of 
the complex backgrounds that surround the faces. Pinto et al. achieved their success using 
densely-sampled local image features with Gabor-based feature descriptors and a Multi- 
ple Kernel Learning (MKL) algorithm to match faces. Although not precisely a Bag of 
Features model, it is similar to the BoF image classification approaches that use dense 
sampling and SVM classifiers. In a cautionary note, however. Pinto et al. suggest that their 
system's performance may have little to do with recognizing faces, and may instead be the 
result of exploiting low-level similarities among image backgrounds. They show that the 
same method fails to be robust to modest amounts of variation when tested on synthetic 
data. 

Pinto et al.'s observation is similar to one in Lazebnik et al. ( [Lazebnik et al| |2006| l. 
Lazebnik et al. note that certain categories in the CaltechlOl benchmark, such as minarets, 
have corner artifacts induced from artificial image rotation. These artifacts are highly 
discriminative and stable cues that generate high performance on the CalTech data set, 
but do not generalize to unrotated pictures of minarets. Indeed, one would expect rotated 
images of other objects to be labeled as minarets! The challenge, therefore, is to evaluate 
BoF methods in such a way as to determine what, exactly, the system is recognizing, in 
order to know whether performance will generalize to future data sets. 



9. Conclusion 

The Bag of Features representation is notable because of its relative simplicity and 
strong performance in a number of vision tasks. Nowak et al. demonstrated compelling 
state-of-the-art performance on the 2005 PASCAL Visual Object Recognition Challenge 
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( |Nowak et al||2006) l. Nister and Stewenius developed break-through scalability by demon- 
strating image retrieval on a million-image data set ( Nister and Stewenius 2006 1. Contin- 
ued research has addressed quantization issues, improved feature detectors and descriptors, 
and developed more compact representations and scalable indexing schemes. 

Yet challenges remain. Comparative evaluation of large-scale image retrieval tasks is 
difficult. This is ironic, because it is the success of the BoF representation that has led to 
the need for huge image data sets, the distribution of which is time and cost prohibitive. 
The sampling strategy remains an open issue. Is it better to use keypoint detectors or 
sample using a scale-space grid? Are SIFT-like descriptors the final answer to the features, 
or should color and other aspects be further explored? BoF methods provide little semantic 
description of image contents. This lack of semantic meaning makes it difficult to integrate 
BoF-based image retrieval with keyword text queries. 

Finally, a BoF approach may be less suitable than other techniques for object detection 
and localization. One can imagine a "Where's Waldo" challenge, where a BoF method 
would misclassify an image as containing Waldo because the localized features have many 
horizontal red and white stripes. Because BoF representations do not identify and local- 
ize objects in images, object recognition performance can be questionable. An image is 
indicated as having an object because its "bag" contains the same type and distribution of 
features as other images containing the object. With no information on the arrangement of 
the features, many false detections could be expected in real-world applications. 

As of this survey, BoF research is still very much an active field, and advances are pub- 
lished at major conferences every year. The technology has advanced considerably in the 
past decade, and we hope to see continued exploitation of this powerful and computation- 
ally cheap representation for a variety of applications. 
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